Layout Analysis and Content Classification in Digitized Books

نویسندگان

  • Andrea Corbelli
  • Lorenzo Baraldi
  • Fabrizio Balducci
  • Costantino Grana
  • Rita Cucchiara
چکیده

Automatic layout analysis has proven to be extremely important in the process of digitization of large amounts of documents. In this paper we present a mixed approach to layout analysis, introducing a SVM-aided layout segmentation process and a classification process based on local and geometrical features. The final output of the automatic analysis algorithm is a complete and structured annotation in JSON format, containing the digitalized text as well as all the references to the illustrations of the input page, and which can be used by visualization interfaces as well as annotation interfaces. We evaluate our algorithm on a large dataset built upon the first volume of the “Enciclopedia Treccani”.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

INEX 2011 Workshop Pre-proceedings

The goal of the INEX 2011 Book Track is to evaluate approaches for supporting users in reading, searching, and navigating book metadata and full texts of digitized books. The investigation is focused around four tasks: 1) the Social Search for Best Books task aims at comparing traditional and user-generated book metadata for retrieval, 2) the Prove It task evaluates focused retrieval approaches...

متن کامل

Overview of the INEX 2011 Book Track

The goal of the INEX 2011 Book Track is to evaluate approaches for supporting users in reading, searching, and navigating book metadata and full texts of digitized books. The investigation is focused around four tasks: 1) the Social Search for Best Books task aims at comparing traditional and user-generated book metadata for retrieval, 2) the Prove It task evaluates focused retrieval approaches...

متن کامل

Overview of the INEX 2011 Books and Social Search Track

The goal of the INEX 2011 Books and Social Search Track is to evaluate approaches for supporting users in reading, searching, and navigating book metadata and full texts of digitized books. The investigation is focused around four tasks: 1) the Social Search for Best Books task aims at comparing traditional and user-generated book metadata for retrieval, 2) the Prove It task evaluates focused r...

متن کامل

Textbook Content Analysis Based on the Viewpoints of Dentistry Students: The Case of "English for Dentistry Students"

Background and purpose: Recognizing the educational needs of students determine their learning goals and leads to better design of course books and educational materials. In this regard, high quality textbooks are a real need. This study aimed to analyze the content of "English for Dentistry Students'' course book based on the viewpoint of dental students. Materials and methods: A descriptive ...

متن کامل

Historical Analysis of National Subjective Wellbeing Using Millions of Digitized Books

Historical Analysis of National Subjective Wellbeing Using Millions of Digitized Books We present the first attempt to construct a long-run historical measure of subjective wellbeing using language corpora derived from millions of digitized books. While existing measures of subjective wellbeing go back to at most the 1970s, we can go back at least 200 years further using our methods. We analyse...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2016